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Abstract 

We consider the statistical mechanics of a full set of two-dimensional protein-like heteropoly- 
mers, whose thermodynamics is characterized by the coil-to-globular (Tg) and the folding (Tf) 
transition temperatures. For our model, the typical time scale for reaching the unique native con- 
formation is shown to scale as tj ~ F{M) exp(c/<7o), where a = 1 — Tf/Tg, M is the number of 
residues, and F{M) scales algebraically with M. We argue that Tf scales linearly with the inverse 
of entropy of low energy non-native states, whereas Tg is almost independent of it. As a — > 0, 
non-productive intermediates decrease, and the initial rapid collapse of the protein leads to struc- 
tures resembling the native state. Based solely on accessible information, a can be used to predict 
sequences that fold rapidly. 
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An apparent puzzle in the protein folding kinetics was raised by Levinthal in late sixties. He 
argued that since the number of conformations of even a moderate sized protein is astronomically 
large, it is unlikely that a polypeptide chain can find the lowest free energy conformation (referred 
to as the native state) in biologically relevant time scales. We note that the time scale in which 
proteins fold in cells is several (twelve or more) orders of magnitude longer than microscopic time 
scales. The belief that proteins find the global free energy minima in times on the order of seconds 
led Levinthal to postulate that there must be "preferred pathways" that direct the folding process. 
Minimal protein models, which capture some but not all of the features considered to be important 
in proteins, have in recent years been used to provide plausible resolutions to this seeming paradox 
0, ^, |], pj. These scenarios have many common elements but differ significantly in detail. The 
unifying idea that has emerged from these studies is that in order to quantitatively describe folding 
kinetics (at least in these models) one has to contend with complex energy landscapes. 

A few years ago we showed using simple two dimensional lattice models that, in general, quasi 
random sequences reach their native conformation by a three stage multipathway kinetics [||, ||, |(| . 
In the first stage the chain collapses from a random coil to a compact state. In the second stage 
the chain searches among the set of compact structures, in a diffusive reptation like mechanism, to 
reach one of minimum energy structures. The final stage involves an activated transition from one 
of the minimum energy structures to the native conformation. Recently tentative estimates of the 
time scales for the various processes in the three stage kinetics have been suggested in terms of M 
the number of amino acid residues in a protein and other experimentally controllable parameters 
|q, H. These estimates support our earlier assertion that this three stage kinetics provides a 
resolution to the Levinthal paradox for single domain proteins which have typically M less than 
about two hundred. The reason that folding of these single domain proteins occurs in time scales 
on the order of seconds is that the average free energy barrier in the third stage scales only as VM 
and not as M as had been supposed by others ||. 

With the Levinthal paradox resolved by the three stage multipathway kinetics [|| for quasiran- 
dom sequences, it is natural to address the following question: For a given value of M is there an 
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intrinsic property of the sequence that essentially determines the folding time? In this paper we 



use a simple model, belonging to the class of HP model proposed by Dill and collaborators [i0|, to 
answer in affirmative the question raised above. We had conjectured earlier || that sequences that 
fold rapidly are characterized by having the coil-to-globular (collapse) transition temperature, Tq, 
and the folding temperature, TV, in close proximity. In particular, the parameter 

a = (T e -T f )/T e (1) 

can be used to classify kinetic accessibility of the native conformation. In this letter, we present 
quantitative estimates of folding times in a number of sequences spanning a range of Tq and Tf 
that explicitly verifies our earlier conjecture. This result implies that kinetic accessibility of the 
native state may in fact be encoded in the primary sequence of proteins. 

The aforementioned prediction can be interpreted in the context of the refolding of a protein-like 
structure from an unfolded conformation (T > Tq) to a folded native-like structure (T IS Tf). It 
is reasonable to suggest that a probes the key role played by the "molten globular" states in the 
folding dynamics, (a) For large a, the dynamical process involves a detailed sampling of transient 
globular states in a rough free energy landscape ||, [7], ||] , which naturally slows down the folding 
kinetics. On the other hand, (b) for a small, the collapse and folding occur almost simultaneously, 
with the chain collapsing almost directly into a folded structure. This was the rationalisation to 
our earlier prediction [H]: the smaller the value of <r, the smaller the value of the folding time scale 
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The polypeptide chain is modelled as a two letter self-avoiding walk on a two dimensional 



square lattice [ 10 1 . Each bead on the chains can be either hydrophobic (H) or hydrophilic (P). 
When two non-bonded H beads are nearest neighbors the interaction is assumed to be — e < 0. 
All other interactions are zero. The temperatures Tq and Tf are computed exactly using series 
enumeration of the finite-size chains (see, e.g., H, ||], pi], |ll|). The simulation results have been 
obtained by using a single step Monte Carlo dynamics and a Metropolis algorithm. All time scales 
are measured in Monte Carlo steps (MCS). Allowed moves are such that they mimic basic features 



of real chain dynamics, i.e. preserve chain connectivity and excluded volume interactions [12]. 
Sequences studied are such that the ratio of the number of hydrophobic Mh to hydrophilic Mp 
sites is set close to its optimal folding value of one , and they have a unique ground state assumed 
to be the native state. Thus, we restrict our analysis to (thermodynamically) foldable sequences. 
To free our statistical analysis from a bias sampling, we consider all possible sequences with the 
aforementioned properties, and protein sizes equal to M = 15 - Mh = 8 (214 sequences) and 
M = 18 - M H = 10 (1326 sequences). 

The time scale ry was measured by fitting and averaging the exponential decay (exp[— £/r/]) 
of the long-time deviation from equilibrium of several correlation functions after a temperature 
quench from a high temperature unfolded structure to the folding temperature TV. For details see 
Ref. ||. One of the correlation functions used is the overlap function (x(t)) which depends on all 
distances rij between sites, with i and j indicating site index along the chain. This function is a 
useful probe of the folding kinetics |J, and is defined as follows 

<*(*)> = 1 " M 2 -Im + 2 < E " rfj)) ■ (2) 

It measures structural differences between fluctuating conformations and the ground (or "native") 
state denoted by the superscript N. ((%) varies between 1 for a fully non-native structure and 
for the pure native state. Random overlap of two structures amounts to (x) = 0.735 for M = 15.) 
We denote non-native states as those structures with, at least, one topological feature different 
from the native state. The folding temperature TV is defined as the temperature at which the 
fluctuations Ax = (x 2 ) — (x) 2 show a peak. Below TV conformations are mostly native, whereas 
above Ty they are non-native. For a protein model with short range interactions, the standard 
definition for Tg is the temperature at which the energy fluctuations ({E 2 ) — (E) 2 ) peak, these 
fluctuations are trivially related to the specific heat which were used to define Tg in Ref. |J. For 
the cases cited in ||, a « (0.50, 0.63, 0.088), whereas y w (40, 230, 1) x 10 5 MCS, for models A, B 
and C, respectively. This apparent correlation has also been recently observed in three dimensional 



lattice simulations by Socci and Onuchic [13|. These authors suggested that reducing the average 



energetic drive toward compactness may lead to a smaller difference in Tg — Tf. 

The first correlation of interest is between cr(Tg, Tf) and the folding time scale Tf. Figure 1 
summarizes this analysis as a function of the parameter a, showing: (a) histograms of a for the 
space of sequences mentioned above; (b) the average overlap of the first exited states with respect 
to the ground state {x)lst', and (c) Tf (the equilibration time scale at Tf) for 30 random sequences 
with M = 15. (a) The histograms of a show broad distributions between a — 0.2 and 0.7. (b) The 
average resemblance of the metastable states (I s * exited states) with the native structure decreases 
as a increases. Indeed, for large a the overlap is close to that of random structures. Hence, as a 
increases metastable states are further apart in configurational space, (c) Based on the above, it 
is not surprising to find that Tf varies by almost three orders of magnitude as a function of a. We 
find the scaling 

t s ~ F(M) exp[a(T ,T f )/a o ] , (3) 

which we predict to be universal. Although it was not checked here, we expect the prefactor 
F(M) ~ M x to also be a universal scaling function with A ~ 3, note the resemblance with Eq. 
4 of Ref. Q, and Refs. [7|, |j~4| . The constant o"o ~ 0.11 is a model dependent parameter. 
The most important conclusion of this analysis is that fast folding sequences are characterized by 
having small values of a. Some limited experimental verification of this prediction have recently 



been shown on fast folding cytochrome c [15|, where folding and collapse have been found to be 
almost synchronous. 

Since folding times correlates well with a it is instructive to find a relationship between a and 
the energy spectrum. Theoretically, we expect Tj^i^Tf) ~ ^NN(Tf), where T denotes free energy 
and N and NN stand for native and non-native states, respectively. Hence, one can write the 
following equation 

T f « ((U NN ) - (U N ))/((S NN ) - (S N )), (4) 

where (U) and {S) denote internal energy and entropy. At Tf it is reasonable to assume that the 
leading contribution to the statistical averages come largely from the low-lying states. Thus, we 
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can estimate (Unn) — {Un) ~ s, obtaining 

e/k B T f « ln(n NN /n N ) , (5) 

where corresponds to the number of states with O^r ~ 2 (i.e. ground state plus its mirror image), 
and ks is the Boltzmann constant hereafter set to one. We conclude that Tf must depend linearly 
on the "entropy" of non-native states. We expect (5) to be an excellent estimate for those sequences 
with a sparse low energy spectrum. 

This prediction is in very good agreement with Fig. 2, where we show e/Tt as a function of 
the degeneracy of the first 0,%, second SI2 and third O3 exited states. For degeneracies larger than 
30, s/Tf scales almost linearly with the logarithm of see solid line in Fig. 2. The striking 

agreement between the fit and (5) led us to conclude that £Inn — &i for large S7i . Although 
noisier a similar correlation is observed for higher energy levels as well. Degeneracies of the energy 
levels grow exponentially, on an average, by a model dependent factor of the order of 10 as the 
energy increases by one. Clearly, this factor depends on the physical constraints of the connectivity 
between states. These observations suggest a certain hierarchy and organization of the energy 
landscape, with closely related energy levels. These correlations can be model dependent. 

Also shown in Fig. 2 is the small but definite dependence of the inverse collapse temperature 
e/Tq with Qx- Fast folding sequences have a somewhat lower Tg, suggesting that these sequences 
have a smaller energetic drive toward compactness [13]. Figure 2 shows the link between the 
broadening oiTg — Tf and "molten globular" states. For clarity, we have also plotted the histograms 
for fii whose shapes are reminiscent of those in Fig. 1. As the ratio MujMp deviates from unity, 
the number of exited states increases dramatically @, with a and -ry increasing accordingly. Hence, 
this analysis presents further evidence regarding the role played by intermediate states (Q) in the 
folding dynamics, and the natural selection of proteins with an optimum content of hydrophobic 
residues 0. 

It is noteworthy that the energy gap between the native state and metastable states in this 
model is always e. A trivial check of correlations between e and Tf shows that there is none. This, 
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of course, contradicts an earlier suggestion of Sali, Shahknovich and Karplus 1C] who suggested 
that the energy gap was enough to predict chain "foldicity", or ease to fold. The gap appropriately 
divided by Tf does seem to show a correlation with Tf [Oj. This correlation, however, deteriorates 
for fast folding sequences governed by entropy [18|. 

Goldstein et. al. |1| proposed that rapid folding sequences are characterized by having a large 
value of the ratio Tf/T g , where T g is an equilibrium transition temperature. From a practical 
point of view, however, this prediction which appears to be supported by simulations pCp , provided 
T g is replaced by a kinetic glass transition temperature, is not very useful. Indeed, there is no 
straightforward technique to measure the glass transition temperature, other than to estimate T g 
from a detailed knowledge of the unknown Tf . 

It is quite clear that based on two thermodynamic parameters one cannot fully describe the 
folding dynamics in a complex energy landscape. The above notwithstanding, one can establish 
meaningful statistical relationships between equilibrium properties and a given set of dynamical 
rules. As long as these rules mimic essential features of the physical processes, the relationships 
should shed some light on the underlying mechanisms. 

We have shown that experimentally accessible information, namely a = 1 — Tf/Tg where Tg is 
the coil-to-globular and Tf is the folding transition temperature, could be used to predict and design 
fast folding sequences of proteins. For sequences that fold rapidly we predict that folding times Tf 
should scale as suggested by (3). This expression embodies the interplay between energy frustration 
and entropic barriers. It recovers the slow folding limit when the protein size M increases and few 
sequences fold fast. Our postulates are physically limited by its statistical nature. In particular, 
averaging over sequence randomness entails standard deviations in Tf of the order of half a decade. 
From a microscopic point of view, a appears to be related to entropy of low-lying states. Minimally 
frustrated fast folding sequences with relatively small a fold in time scales mostly governed by 
entropic considerations. 
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Figure Captions 

1. Statistical analysis of a for non-degenerate sequences with M = 15 and M = 18 (see text), 
(a) Bottom, histograms of a: dashed line and solid lines correspond to M = 15 and M = 18, 
respectively, (b) Middle, average overlap between all first exited states and the ground state 
{x)ist- + and o symbols correspond to M = 15 and M = 18, respectively. Least square fit 
shows a linear dependence of (x)nt on ° (solid line), (c) Top, folding time scale tj (□) as a 
function of a for 30 random sequences with M = 15. Error bars are of the order of symbol 
size. Least square fit yields 1.8 x 10 5 exp(<r/0.11) (solid line). Bars in (b) and (c) correspond 
to one standard deviation. 

2. Inverse of folding temperature as a function of the number of first (fii), second and third 
exited states. Solid line corresponds to e/Tf = ln(fii/2). Symbols are as in Fig. lb. To 
better resolve the overlap of symbols, we show histograms of Q± for M = 15 (dashed line) 
and M = 18 (solid line). Also shown is inverse of collapse temperature e/Tg (x) as a function 
of fli. 
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